-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update all usages of fleettools to use the installed Agent ID #7054
base: main
Are you sure you want to change the base?
Update all usages of fleettools to use the installed Agent ID #7054
Conversation
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
There were cloud instability while creating deployment, restarting tests. |
buildkite test this |
pkg/testing/fixture.go
Outdated
defer cancel() | ||
|
||
var lastErr error | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that it is more appropriate for any retry logic to be up for the caller to implement if necessary, thus, this function shouldn't implicitly keep retrying to get the status of an agent. wdyt? 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means everywhere in the testing framework will have to add retry logic. I am not a fan of that. It will pollute the testing code. Being it is testing code, I thought it would be best to place the logic here.
I was thinking of splitting it into two functions one with retry and one without, but I couldn't find a single place where I would prefer the function with no retry over the function with retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pollution 😄 I never thought of it like that, ok I will keep it in mind next time.
I couldn't find a single place where I would prefer the function with no retry over the function with retry.
To me that sounds like that you would prefer to always call the one with retry, just to stay on the "safe" side. otherwise you wouldn't introduce it to begin with?!
Ok let's do what you say then, if you need to retry because tests lack at the moment the waiting for the AgentID to be successful, let's try to minimise the "pollution", with a separate call that allows at least the caller to specify the retry knobs and call that from everywhere 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that deciding if being unable to get an agent id is a showstopper or if retries should be performed (maybe using an assert.Eventually()
or some other assertion) should be up to the specific test.
Also assuming that this can last up to 1 minute may be wrong depending on the testcase.
I would prefer to not have "hidden" mechanisms in the utility functions, if we need to change the test code so be it: explicit testcase code is preferable in my opinion.
One more thing: what is the case where an installed and enrolled agent does not have an AgentID exactly? Wouldn't that be an issue with either the enroll operation or the test structure ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why we would want to same retry logic in every test. Most developers want a DRY method of development, where the same code is not repeated everywhere. Not having it in the function results in more cases of errors and flakiness in tests as well, if the developer doesn't add the extra code to ensure that retries are performed. Overall retry logic in the function provides cleaner implementation in the test code, which is where we should strive for improved readability. Overall I would prefer to see this type of logic placed into other functions.
I have updated to code to allow the caller to disable retries, adjust the timeout and interval as well. I don't see the need for those honestly, but let see if you like that better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually prefer to move this logic into ExecStatus
, because that is where this really belongs. It is very much a retry on failed connection to a remote GRPC server. It might even be better to place this directly in the elastic-agent status
command, but that will not fix previous versions. Being that many of the tests test installation of old versions to upgrade to latest versions, this will not be helpful in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allright, in that case I'm fine with it. Thanks for the explanation! It would be nice to add a comment to the empty string check to make it clear this is a failsafe.
Incidentally, this isn't related to your change, but how does agent not having an ID but the control protocol server running actually happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done just that. Moved the retry logic into ExecStatus
as that is the appropriate place for that logic. This is very much about retries for communication with the Elastic Agent daemon which is a local GRPC server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incidentally, this isn't related to your change, but how does agent not having an ID but the control protocol server running actually happen?
Honestly, I just think it could happen. I was just being defensive. The definition of quality software to me is how it handles the unknown. I don't actually know, maybe it is always set.
I have removed that for now. I guess we will see if it becomes an issue. If we start getting failures saying the Agent ID is empty, we will know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed that for now. I guess we will see if it becomes an issue. If we start getting failures saying the Agent ID is empty, we will know.
I like that decision. If it can happen, I'd consider it a bug, so a test catching it would be a good thing.
This pull request is now in conflicts. Could you fix it? 🙏
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
until we discuss with the team about what DRY is and when to try to perform it and what developers of integration tests should seek out and all of us collectively agree to a definition, I believe that this PR should not be merged
Happy to discuss. |
|
if err != nil { | ||
return "", err | ||
} | ||
return status.Info.ID, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is possibly misleading, because even standalone agents have IDs, that are then replaced with one generated by Fleet after enrollment succeeds.
For example I see the following with a local standalone agent. Notice "is_managed": false
there but "id": "913ce739-2c6c-45e9-90f5-2226a14bca70"
being populated.
sudo elastic-development-agent status --output=json
{
"info": {
"id": "913ce739-2c6c-45e9-90f5-2226a14bca70",
"version": "9.1.0",
"commit": "d2047ac48df2f4536ca69a86ad4922b3e264501a",
"build_time": "2025-02-25 21:52:49 +0000 UTC",
"snapshot": true,
"pid": 70294,
"unprivileged": false,
"is_managed": false
},
"state": 2,
"message": "Running",
Just looking at the ID at any one point in time is not going to give you a valid ID for making requests to Fleet.
We probably want an explicit entry in the status output for the ID as assigned by Fleet so we can poll for it to be populated. Otherwise I worry there will be a race conditions in tests where sometimes the standalone ID is picked up before it replaced by the one assigned during enrollment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not misleading, but it does greatly depend on when you ask for the AgentID. You must check after enrollment has occurred. You don't need to worry about it picking up the wrong ID, as long as you are calling it at the correct time. I think AgentID()
is also useful in the standalone case, so I don't think check if is_managed: true
would be correct for this type of call.
💚 Build Succeeded
History
cc @blakerouse |
What does this PR do?
Updates all integration tests to use the installed Elastic Agent ID from the status output to check with Fleet for information about the specific Elastic Agent.
Why is it important?
This ensures that the tests in the integration framework are only communicating with Fleet about that specific Elastic Agent. This removes the need to filter based on hostname or doing any type of paging with the Kibana API to find that specific Elastic Agent. We know the Elastic Agent ID as the test installed it, it should always use that ID.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added tests that prove my fix is effective or that my feature works(all integration tests)[ ] I have added an entry in(testing only)./changelog/fragments
using the changelog toolDisruptive User Impact
None
How to test this PR locally
mage integration:test
Related issues
Questions to ask yourself